Exploration of Vancouver Trees

Author : Muntakim Rahman   UBC Student Number : 71065221

Introduction

This notebook will be conducting an analysis for the Vancouver Trees dataset located in the small_unique_vancouver.csv file.

Import Packages

Observe Outputs

Let's start by getting an understanding of the data sparsity (i.e. NULL values), as well as the column distributions.

Data Sparsity

There are NULL occurrences in the date_planted, plant_area, cultivar_name columns. Let's keep these for now to visualize the data in the entries without NULL values.

Non-Numeric Data

Observing the data stored as objects, there seem to be variation in distinct values for given columns.

The std_street and on_street column have greater than 600 distinct values and would not be good candidates for the EDA.

Looking at the date_planted column, it seems that there are only 1599 distinct values in the entire dataset. This would entail repeated dates across the entries, which is rather interesting.

The curb and root_barrier columns are binary in nature and should be one-hot encoded in our final analysis.

Numeric Data

Observing the data stored as type np.number, there seem to be differences in std deviation for given columns.

Based on the std deviation of 75412.260406, the tree_id column probably includes data for a unique identifier. We can use this to identify our trees, but it doesn't serve much other use for our EDA.

There is a very large std deviation for the civic_number column, with the min value being 2 and the max being 9113. There is similar behavior in the on_street_block column, which very similar mean, min, and max values to civic_number. I'm not particularly interested in these columns, but we can visualize the correlation.

The height_range_id column has a mean value, as well as a 25th and 50th percentile ~2 which is interesting. I'd like to see the distribution of this column.

The latitude and longitude column have a std deviation less than 0.1, which would entail most trees being in the same vicinity. We can try using this data to see where trees are densely concentrated.

Questions of Interest

We want to explore this dataset to understand :

Columns of Interest

We are going to be visualizing the data in the following columns :

Data Transformation

Prior to visualizing the dataset, we will be assigning the decade_planted column to provide more meaning to the time periods in which trees were planted. This will also enable us to implement a decade_planted filter to our visualizations.

Analysis

Q1 : What trees are commonly found in Vancouver?

Let's plot the count of each genus_name to visualize the most and least common trees within the city. Let's trim down the genus_name visualized to include the 10 most and 10 least common trees.

From `Figure 1` :

Q2 : Where are trees located in Vancouver?

Let's bin the latitude and longitude coordinates in a heatmap to visualize the tree density within a given area.

From `Figure 2` :

Q3 : What Sizes are Vancouver Trees? Is there a Relationship Between Diameter and Height Range ID?

Let's plot the diameter and height_range_id columns to visualize the relationship between the two properties. This might act as a proxy for determining whether trees occupying a greater area also tend to be taller.

From `Figure 3` :

Q4 : What neighborhoods have the largest trees? What about the smallest trees?

Let's look at the breakdown of this data for both diameter and height_range_id by neighborhood_name.

From `Figure 4` :

Q5 : How did tree sizes change by decade?

From `Figure 5` :

Further Questions

I would like to explore the data in these charts when filtered for criteria including :

A few questions start to emerge when looking at data for the columns we've considered for size, as well as trends over time.

Interactive Dashboard

Let's create a dashboard from the visuals above in order to start investigating these questions.

Visualizations

Choices

In order to visualize the density of data points, we have used both a traditional heatmap and circle plot of varying colors and sizes. The viridis color scheme enables the clear distinction of changes in density, complemented by a legend.

We have also used a histogram to visualize the number_of_trees per genus_name in the effective dataset. This acts as a simple means of displaying the breakdown of tree genera, while also acting as a dashboard filter.

To compare the diameter and height_range_id distributions, we have layered the data by decade_planted in a translucent area chart. Since we are implementing color to highlight the nominal decade_planted feature here, we are using the tableau10 color scheme.

Potential Improvements

When arranged in a dashboard, the heatmap and circle plot are not aligned. This may be visually jarring and could be improved through ensuring consistency of height and width.

Another improvement would be to ensure that axes are fixed for decade_planted filter selections in the area chart. This filter can be used to help remove unnecessary decade_planted data. This would cause easier visual transitions on the area chart as we look at effective decade_planted values. Consequently, the area chart would also be more accomodating to users with visual deficiencies.

Discussion

We have intentionally decided to visualize the charts in a dashboard prior to our final discussion. This enables us to answer the subsequent questions which arose from our initial analysis.

Summary

Tree Genera

Neighbourhoods

Location

Unanswered Questions

In order to understand where trees are being planted over time, it would make sense to visualize the time-series data of number_of_trees and compare this for different neighbourhood_name values.</br> Another question which arises from the dashboard is whether neighbourhood_name or latitude/longitude coordinates are the better indicator for tree location. These could be better explored in a subsequent analysis. We could visualize the data in a map of Vancouver to get a clear understanding.

References

The data were obtained from The city of Vancouver's Open Data Portal and follows an Open Government Licence – Vancouver.

These additional resources provide the theory and code segments for the Analysis Report in this notebook :